A large expert-curated cryo-EM image dataset for machine learning protein particle picking

Cryo-electron microscopy (cryo-EM) is a powerful technique for determining the structures of biological macromolecular complexes. Picking single-protein particles from cryo-EM micrographs is a crucial step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though machine learning and artificial intelligence (AI) based particle picking can potentially automate the process, its development is hindered by lack of large, high-quality labelled training data. To address this bottleneck, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for protein particle picking and analysis. It consists of labelled cryo-EM micrographs (images) of 34 representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). The dataset is 2.6 terabytes and includes 9,893 high-resolution micrographs with labelled protein particle coordinates. The labelling process was rigorously validated through 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of both AI and classical methods for automated cryo-EM protein particle picking.

Overview of Cryo-EM pipeline, from sample preparation to particle recognition.  41 , and HydraPicker 42 all encountered similar problems. They usually perform well on the small, standard datasets used to train and test them (e.g., Apoferritin and KLH), but may not generalize well to non-ideal, realistic datasets containing protein particles of irregular and complex shapes, which are generated daily by the cryo-EM facility around the world.
To address this key bottleneck hindering the development of machine learning and AI methods for automated cryo-EM protein particle picking, we created a large dataset of cryo-EM micrographs, named CryoPPP 43 , which includes manually labelled protein particles. The micrographs are associated with 34 representative proteins of diverse sequences and structures that cover a much larger protein particle space than the existing datasets of a few proteins such as Apoferritin and KLH.
In this study, we ensured the inclusivity of diverse protein particles by considering various protein types, shape patterns, sizes (ranging from 77.14 kDa to 2198.78 kDa), source organisms, and different variations of the micrographs. Supplementary Table S3 provides the additional information on the different protein types that were selected to label particles. Additionally, we incorporated cryo-EM images with varying defocus values associated with each EMPIAR ID to include a diverse set of micrographs. The box-whisker plot in Supplementary  Fig. S1 shows the diversity of the defocus values in each EMPIAR ID.
Machine learning methods yield best results when being trained on a diverse set of representative data. Hence, we included micrographs containing particles with different physical features and complexities, as shown in Fig. 2. Figure 2A-D depict the most ideal cases of particle picking in which the protein particles are visible to naked eyes. If machine learning methods are only trained solely on these micrographs, they perform relatively well in the simple cases but poorly on more challenging micrographs that have low signal-to-noise ratio. Thus, www.nature.com/scientificdata www.nature.com/scientificdata/ we selected difficult micrographs that contain both particles and lots of ice patches, contaminations (Fig. 2B) and carbon regions (Fig. 2C).
In addition to the aforementioned variations in the micrographs, CryoPPP also comprises micrographs containing mono-dispersed (Fig. 2D) particles characterized by particles of uniform size and micrographs with heterogenous confirmations that have varying top, side and inclined views of particles. Other micrographs with sub-optimal concentration of particles, clusters of proteins, and non-uniform ice distributions were included. The detailed features of the 34 diverse sets of cryoEM micrographs in CryoPPP are described in Supplementary  Table S3.
The quality of the manually labeled particles of selected proteins was eventually validated rigorously. This validation was done against particles originally labelled by the authors who generated the cryo-EM data (called gold standard here) by both 2D particle class validation and 3D cryo-EM density map validation. The quality of our manual annotation is on par with the annotations provided by the experts who created the data in the first place, which confirms our manual particle labelling process is effective. Therefore, we believe CryoPPP is a valuable resource for training and testing machine learning and AI methods for automated protein particle picking.

Methods
CryoPPP was created through a series of steps as shown in Fig. 3. We first crawled the data from the EMPIAR website using python API and FTP scripts fetched through Bash scripting. We filtered out microscopic images of various non-single-protein particles (e.g., bacteria, filaments, RNA, protein fibril, virus-like particles) and retained only high-resolution micrographs acquired by cryo-EM technique for manual particle labeling.
After importing the micrographs with all the physiochemical parameters gathered from the corresponding published literature, we performed motion correction and Contrast Transfer Function (CTF) estimation for them. After preparing the micrographs, two human experts conducted an initial manual particle selection process separately, using low pass filter values and appropriate particle diameter settings, on approximately 20 out of about 300 micrographs per selected protein (EMPIAR ID).
The expert-picked particles were cross-validated and then went through 2D particle classification. The best particles based on resolution, particle count, and visually appealing and sensible 2D classes were selected and further used for template-based particle picking and further human inspection. After iterating the 2D classes from template-based picking and human inspection, we ultimately obtained the final set of highly confident protein particles as ground truth and exported them in the files in star, csv and mrc formats. The first two files (.star and.csv) contain the coordinates of the protein particles and the latter (.mrc) store particle stacks. The process of creating CryoPPP in Fig. 3 is described in the following sections in detail.
EMPIAR metadata collection and filtering. The process of preparing the dataset began with collecting metadata about cryo-EM image datasets in EMPIAR. Data collection scripts that use python API and FTP protocols were used to automatically download the metadata from the EMPIAR web portal 31 . The metadata includes EMPIAR ID of each cryo-EM dataset of a protein, the corresponding Electron Microscopy Data Bank (EMDB) ID, Protein Data Bank (PDB) ID, size of dataset, resolution, total number of micrographs, image size/type, pixel spacing, micrograph file extension, gain/motion correction file extension, FTP path for micrograph/gain files, Globus path for micrograph/gain files, and publication information.
Following the metadata collection, the individual cryo-EM datasets in the collection were filtered as depicted in Fig. 4 (Steps 1-5). First, we only chose EMPIAR IDs (datasets) that have their volume maps deposited in EMDB. From the chosen EMPIAR datasets, we only selected ones that had corresponding protein structures in the Protein Data Bank (PDB).
To ensure high data quality, we then retained only the EMPIAR datasets whose resolution was better than 4 Angstrom (Å). We observed that there were some redundant EMPIAR datasets (e.g., EMPIAR ID: 10709 & 10707, EMPIAR ID: 10899 & 10897) that correspond to the same biomolecule with the same PDB and EMDB IDs. Hence, we eliminated those duplicate entries. After removing duplicate records, we selected only EMPIAR datasets that contained micrographs of protein particles, excluding other biological samples such as viruses. This filtering step required some literature study of individual EMPIAR datasets. The motion correction and gain correction files for the selected datasets were extracted from the EMPIAR if required. The final list of meta data includes 335 EMPIAR entries, 34 out of which were used for manual labelling. Refer to the EMPIAR_meta-data_335.xlsx file in CryoPPP for further information about the list of 335 datasets of 355 proteins.
Manual particle labeling. Manually picking particles in cryo-EM micrographs through the GUI interfaces of cryo-EM analysis tools such as CryoSPARC 16 , EMAN2 14 and RELION 15 is very time consuming. We carefully imported large micrographs, carried out the motion correction, and estimated CTF, especially in the case when flipping and rotating gain files are required to match the input array shape of the micrograph and gain file. Furthermore, it takes a lot of disk space to store the labelled particle data together with the corresponding micrographs and particle stack files. Therefore, we chose 34 representative EMPIAR datasets out of 335 entries selected in the previous section for manual particle labelling to create the CryoPPP dataset, considering diverse particle size/shapes, density distribution, noise level, and ice and carbon areas. Moreover, proteins from a wide range of categories, such as: metal binding, transport, membrane, nuclear, signaling, and viral proteins were selected. See supplementary Tables S1, S3 for more details about the 34 proteins (cryo-EM datasets). Most of the pre-processing, manual particle labelling, real-time quality assessment, and decision-making workflows were performed using CryoSPARC v4.1.1 16 , EMAN2 14 , and RELION 4.0 15 .
CryoPPP includes a total of 9,893 micrographs (~300 Cryo-EM images per selected EMPIAR dataset). We labelled ~300 micrographs per EMPIAR data because using all the micrographs in each dataset would result in 32.993 TB of data, making it too large to store, transfer, and use for most machine learning tasks.
www.nature.com/scientificdata www.nature.com/scientificdata/ There is no rule of thumb to determine the number of particles required for training machine learning methods to achieve optimal performance. It varies from protein to protein. It is also worth noting that the resolution of 3D density map is not solely dependent on the number of picked 2D particles, but also on the inclusion of a diverse range of particle orientations, which can significantly improve the resolution. CryoPPP contains 2.7 million particles of 34 different proteins, which should be adequate for machine learning algorithms to learn relevant particle features. Typically, we utilized approximately 300 micrographs per protein to pick particles. However, if adequate particle views were not extracted from the first 300 micrographs, a larger number of micrographs were Importing movies. This is the crucial first step of particle labeling. For each EMPIAR dataset, we import two inputs: micrographs and gain reference (motion correction files). We analyzed the description of the EM data acquisition and grid preparation for each dataset in order to collect the important information such as raw pixel size (Å), acceleration voltage (kV), spherical aberration (mm), and total exposure dose (e/Å 2 ) for the micrographs in the dataset.
Furthermore, we obtained gain reference for micrographs if their motion was not corrected before. We used e2proc2d, a generic 2-D image processing program in EMAN2 14 , to convert different formats of motion correction file (e.g.,.dm4,.tiff,.dat, etc.) to.mrc file since CryoSPARC accepts only.mrc extension. Then, based on the microscope camera settings and how the data was acquired during the imaging process, we applied geometrical transformations (flip gain reference and defect file left-to-right/top-to-bottom (in x/y axis) or rotate gain reference clockwise/anti-clockwise by certain degrees) relative to the image data. Supplementary Table S1 contains the details of input parameters for each EMPIAR ID. After importing movies and motion correction files, we proceeded to the job inspection panel of CryoSPARC to ensure that all input settings and loaded micrographs were correct.
Patch motion correction. When specimens are exposed to an electron beam, the mobility of sample molecules (protein particles) during data acquisition can affect the overall quality of electron micrographs and lower the final resolution 44 . Hence, it is necessary to correct the movement of particles (referred to as 'beam-induced motion').
The causes behind this motion can be categorized into two types: (1) Motion from Microscope: It is caused by stage drift and usually occurs in microscope due to little amount of vibration left over after the stage has been aligned to a new position 45 . It moves the sample relative to the beam and optical axis. This motion is quite jagged in time, with sharp accelerations or twitches, but is consistent. The entire image will move in the same direction over time. (2) Motion from sample deformation: This motion is caused by the energy deposited into the ice by the beam, or energy already trapped in it, due to strained forces locked in during freezing. It is eventually released during the image capturing process. As the electrons pass through the samples, the energy from the beam and the temperature change causes the ice to physically deform and bend. That deformation is often smoother over time, but it can be highly anisotropic in space. In this case, various parts of the same image can move in different directions at the same time.
Both motions must be estimated and corrected to obtain high-resolution reconstructions from the data. In the patch-based motion correction step, we corrected both global motion (stage drift) and local motion (beam-induced anisotropic sample deformation) for the micrographs (as shown in Fig. 5A using CryoSPARC. In the anisotropic deformation plot in Fig. 5B, each red circle indicates the center of a single "patch" of the image, and the curves emerging from each circle show the motion of that portion of the sample. We can observe the correlation between the motion of adjacent patches. They move somewhat similarly to one another. To prevent the fit from being distorted by random noise in the micrograph, the patch motion correction algorithm imposes smoothness constraints on the motion. Figure 5C are the examples of plots generated by patch motion correction that depict the computed trajectories. The set of plots shows overall motion correction (an actual trajectory plot, followed by X-motion plot and Y-motion plots over time). In the overall motion trajectory over X and Y motion (Fig. 5C, Left), each dot represents the sample's position from frame to frame. Here, the x and y axes represent the units of pixels in the raw data's pixel size. The sample begins at point (X), moves downward, makes a curve and again changes direction toward the left-top, and then continues to descend to the left. We apply this trajectory to the input data by shifting each image in reverse of what the motion trajectory suggests and finally averaging images together. In other words, we track a sample's motion during the exposure to undo it. www.nature.com/scientificdata www.nature.com/scientificdata/ While curating CryoPPP, we noticed several factors (protein size, charge, and ice thickness) potentially influencing the scale of local motion in Cryo-EM micrographs. Larger proteins tend to exhibit slower local motion than smaller ones, due to their increased mass and higher structural complexity. The charge distribution on a protein surface can also affect the scale of local motion, with highly charged proteins exhibiting more restricted motion due to the electrostatic interactions with the surrounding ice, while relatively neutral proteins may be more mobile. Additionally, we observed that the thickness of the ice layer surrounding a protein can influence local motion; thicker ice layers can provide more structural stability but may introduce more noise and distortion in the micrograph.
Patch-based CTF estimation. The contrast of images captured in the electron microscope is affected by imaging defocus and lens aberrations, which are adjusted by microscope operators to enhance the contrast. The relationship between lens aberrations and the contrast in the image is defined by the CTF. It explains how information is transferred as a function of spatial frequency. www.nature.com/scientificdata www.nature.com/scientificdata/ It is important to estimate CTF, which is then corrected during 2D particle classification and 3D reconstruction steps. Otherwise, the feasible reconstruction will have extremely low resolution. A full treatment of the effects of the CTF usually proceeds in two stages: CTF estimation and CTF correction. In CryoSPARC used in this work, the CTF model is given by the Eq. 1.

CTF
cos where Δz is defocus, λ e is the wavelength of the incident electrons, C s is spherical aberration, and f is spatial frequency. Φ represents a phase shift factor. Most cryo-EM samples are not 'flat' . Before a sample is frozen, particles tend to concentrate around the air-water interfaces, and the ice surface itself is usually not flat 46,47 . Because defocus has an impact on the CTF, distinct particles can have various defoci and hence various CTFs within a single image. To address this problem, CryoSPARC offers a patch-based CTF estimator that analyzes numerous regions of a micrograph to calculate a "defocus landscape".
We performed a 1D search over defocus for every micrograph. Figure 6A depicts the 1D search for a particular micrograph of EMPIAR ID 10737 48 . This plot helps identify a particular defocus value that stands out among a variety of other defocus values (x-axis). Patch CTF creates a plot showing how closely the input micrographs' observed power spectrum and the calculated CTF match. The CTF fit plot in Fig. 6B shows that the computed CTF matches the observed power spectrum up to a resolution of 3. 993 Å. The cross correlation between the observed spectrum and the calculated CTF is depicted by the cyan line in the plot. The vertical green line in the plot represents the frequency at which the fit deviates from CryoSPARC's cross-correlation threshold of 0.3 for a successful fit.
We executed the patch CTF to obtain the output micrographs with data on their average defocus and the defocus landscape. When particles were extracted, this data was automatically used to assign each particle a local defocus value based on its position in the landscape. www.nature.com/scientificdata www.nature.com/scientificdata/ Manual particle picking. After performing the motion correction and CTF estimation, we manually picked particles interactively from aligned/motion-corrected micrographs with the goal of creating particle templates for auto-picking using 'Manual Picker' job in CryoSPARC. Depending on the size and shape of the protein particles, we adjusted the box size and the particle diameter. Since picking particles on raw micrographs is extremely difficult, we tweaked the 'Contrast Intensity Override' while viewing micrographs in order to obtain the best distinctive view for picking particles.
It is particularly challenging to manually pick particles from micrographs with smaller defocus levels. Figure 7 illustrates the visualization of micrographs in the same dataset with different defocus levels for EMPAIR 10532 48 . It is worth pointing out that the task of particle picking becomes significantly more challenging for AI methods when the micrographs have a low signal-to-noise ratio, contain numerous ice-patches, carbon areas, aggregated particles, and non-uniform ice distribution, as illustrated in Fig. 2B,C,E,G). Hence, to generate comprehensive templates, we manually picked particles from diverse micrographs and the micrographs with wide defocus range (see supplementary Fig. S1) and CTF fit values.
As manual picking was very time intensive, we selected a subset of micrographs (around 20 micrographs of each EMPIAR dataset) for manually picking initial particles for the subsequent template-based particle picking.
To ensure data reproducibility, the intermediate star files of manually expert-picked particles are deposited in CryoPPP and can be found under the ground-truth subdirectory. Additional information regarding the total number of manually picked particles, number of micrographs considered for manual picking, particle diameter, angular sampling, and minimum separation distance of particles are provided in the Supplementary Table S2.
Forming and selecting best 2D particle classes. The manually picked particles went through the 2D classification step. This step helped to classify the picked particles into several 2D classes to facilitate stack cleaning and junk particles removal. To analyze the distribution of views within the dataset qualitatively, we specified a specific number of 2D classes. By doing this, we investigated the particle quality and removed junk particle classes, which ultimately facilitated the selection of good particle classes.
We specified the initial Classification Uncertainty Factor (ICUF) and maximum alignment resolution to align particles to the classes with 40 expectation maximization (EM) iterations. The diameter of the circular mask that was applied to the 2D classes at each iteration was controlled using the circular mask diameter in the case of crowded particles.
After the 2D classes were formed, we selected the best particle classes interactively to remove the junks. Figure 8 shows an example of 2D classification and selection of highly confident particles for EMPIAR ID 10017 49 . We used three diagnostic measures to select the 2D classes: resolution (Å) of a class, the number of particles of a class (higher, better), visual appearance of a class. Considering only the number of particles in a class is not sufficient because some classes containing a small number of particles may represent a unique view of the protein.
Refer to the Supplementary Table S2 for additional details regarding the number of 2D classes in each EMPIAR dataset, window inner and outer radii, recenter mask threshold, number of iterations to anneal sigma as well as other relevant thresholds and parameters.
Template based picking and manual inspection and extraction of particles. After the best particle classes were selected and exported, we used a template generated from the 'Forming and Selecting Best 2D' step to pick more particles. The process was iterative, meaning that the output of a round of 'template-based picking and inspection' was again utilized for '2D class formation' step to form and select best 2D classes under the human inspection. This process was repeated until we acquired high resolution particles that include all possible particle projection angles. www.nature.com/scientificdata www.nature.com/scientificdata/ The final templates with green boxes (as shown in Fig. 8) were used to execute auto-pick particles from micrographs. With CryoSPARC's Template Picker, we used high resolution templates to precisely select particles that matched the geometry of the target structure. Figure 9A represents manually picked particles for EMPIAR-10017 49 that work as templates to facilitate template-based picking that eventually results in template-based picked particles ready for human inspection as shown in Fig. 9B. We specified constraints like particle diameter in angstrom (see Supplementary Table S2 for more information) and a minimum distance between particles to generate the templates based on the SK97 sampling algorithm 34 to remove any signals from the corners and prevent crowding. We observed that the blob-based in picking in RELION required minimum  www.nature.com/scientificdata www.nature.com/scientificdata/ and maximum allowed diameter of the blobs, whereas defining a single value for particle's diameter works well in CryoSPARC's template particle picking step.
Finally, the particles obtained by the template picking went through the manual inspection step, where we examined and modified picks using various thresholds. We adjusted the lowpass filter, optimum power score range, and normalized cross-correlation (NCC) score (see Supplementary Table S2) to improve the visibility of the picks. In this step, we removed false positive particles as depicted in Fig. 12B. While efforts were made to minimize false negatives during particle picking, it is important to recognize that the complete elimination of false negatives is impossible in one round of annotation. The 2D colored histogram plots as depicted in Fig. 12 were used to scrutinize micrograph median pick scores versus defocus for extracting the coordinates of high-quality protein particles. Hence, we have strived to optimize particle picking, aiming to minimize both false positives and false negatives and thus improving the accuracy and reliability of CryoPPP. We will continue to improve and update CryoPPP throughout its life cycle, considering the input from external users.

Data Records
The CryoPPP dataset 43 consists of manually labelled 9,893 micrographs of 34 diverse, representative cryo-EM datasets of 34 protein complexes selected from EMPIAR. Each EMPIAR dataset identified by a unique EMPIAR ID has about ~300 cryo-EM images in which the coordinates of protein particles were labeled and cross-validated by two experts aided by software tools. The total size of CryoPPP is 2.6 TB. Additional statistical information about CryoPPP can be found in Table 1.
The full dataset is available at https://github.com/BioinfoMachineLearning/cryoppp. For researchers who have limited disk space, a much smaller light version of CryoPPP, called CryoPPP_Lite, can also be downloaded from the website. CryoPPP_Lite includes the micrograph files in the 8-bit JPG format and the particle ground truth files that only need 121 GB disk space in total, which is easier to store and transfer. www.nature.com/scientificdata www.nature.com/scientificdata/ Each data folder (named by its corresponding EMPIAR ID) includes the following information: original micrographs (either motion-corrected or not), gain motion correction file, new motion-corrected micrographs (if original micrographs are not motion-corrected), ground truth labels (manually picked particles), and particles stack. The directory structure of each data entry is illustrated in Fig. 10. The data in each directory is described as follows. It is worth noting that if the original micrographs were not motion-corrected, we applied the motion correction to them to create their motion-corrected counterparts.
Raw micrographs. These are the two-dimensional projections of the protein particles in different orientations stored in different image formats (MRC, TIFF, EER, TIF, etc.). They can be considered as the photos taken by cryo-EM microscope. Original micrographs are from EMPIAR and can be either motion corrected or not. If an entry has a 'gain' folder, it includes both raw non-motion-corrected micrographs and their motion-corrected counterparts created by us. Users are supposed to use the motion corrected micrographs as input for machine learning tasks. The scripts for the motion correction are available at CryoPPP's GitHub website.

Motion correction (gain files). It contains motion correction files (if motion in original micrographs not
corrected before) stored in different formats like dm4 and mrc. It is used to correct both global motion (stage drift) and local motion (beam-induced anisotropic sample deformation) that occur when specimens (protein particles) are exposed to the electron beam during imaging. Correcting the motion enables the high-resolution reconstruction from the data.
Particle stack. Particle stack comprises of the mrc files (with names corresponding to individual micrographs' filenames) of manually picked protein particles (ground truth labels). These are three-dimensional grids of voxels with values corresponding to electron density (i.e., a stack of 2D images). To browse and examine this file, utilize EMAN2 14 , UCSF Chimera 50 , or UCSF ChimeraX 51 .
Ground truth labels. Ground truth data contain the star and CSV files for both all true particles (positives) and some typical false positives (e.g., ice contaminations, aggregates, and carbon edges). The positive star (and corresponding CSV) files are the ground truth position of the picked particles combined in a single file for all ~300 micrographs per EMPIAR ID. While the negative star file consists position of the false positive particles. These star files contain information like X-coordinate, Y-coordinate, Angle-Psi, Origin X (Ang), Origin Y (Ang), Defocus U, Defocus V, Defocus Angle, Phase Shift, CTF B Factor, Optics Group, and Class Number of the particles.
To ensure the reproducibility of the dataset, we have included the intermediate data (Fig. 10, ground_truth >> intermediate_data) along with the intermediate metadata (presented in Supplementary Table S2). The intermediate data comprise the star files of manually expert-picked particles, which were utilized to construct templates for further particle selection.
Besides, there is a subdirectory called particle_coordinates inside ground_truth, which contains csv files, with same name as raw micrographs. This sub-directory comprises individual protein particle's X-Coordinate, Y-Coordinate along with their diameter and other physico-chemical information.

technical Validation
To ensure that the dataset is of high quality, we applied numerous validations and statistical analyses throughout the data curation process. Fig. 10 The directory structure of each expert-labelled data entry of CryoPPP. The directory contains micrographs, motion correction files, particle stacks, and ground truth labels (manually picked particles). The blocks with numbers on the left represent corresponding EMPIAR IDs.
www.nature.com/scientificdata www.nature.com/scientificdata/ Quality of data. As noted in Fig. 4, we ensure that the dataset exclusively contains micrographs obtained using the Cryo-EM technique. Only the EMPIAR IDs with resolution better than 4 Å are chosen for creating refined protein metadata and ground truth labels of protein particles. The detailed quality control procedures are described as follows.

Distribution of data. Diverse protein types.
To be inclusive and ensure unbiased data generation, we selected the cryo-EM data of 34 different, diverse protein types (e.g., membrane, transport, metal binding, signalling, nuclear, viral proteins) to manually label protein particles, which can enable machine learning methods trained on them to work for many different proteins in the real-world. We selected the datasets covering different particle size, distribution density, noise level, ice and carbon areas, and particle shape as they are influential in particle picking.
Diverse micrographs within the same protein type. The variance in micrographs' defocus values within a EMPIAR dataset is not accounted for by majority of the particle picking methods. This defocus variation causes the same particles to appear differently, altering the noise statistics of each micrograph. This makes it challenging to create thresholds to select high quality particles. Figure 7 shows an example how different defocus values impact the appearance and quality of Cryo-EM images in the same EMPIAR dataset. Therefore, during manually picking the particles, we included a wide variety of defocus levels and CTF fit.
We recorded the correlation between defocus levels and the pick scores/the power scores (shown in Fig. 11 for EMPIAR-10590 51 ) to assess the shape and density of a particle candidate independently. After calibration, the scores of each particle are recorded relative to the calibration line, and these values are used to define thresholds on the parameters. www.nature.com/scientificdata www.nature.com/scientificdata/

Reliability of ground truth annotations. Legitimacy of importing micrographs and motion correction
data. All the input parameters used to prepare for loading micrographs into the CryoSPARC system were gathered from the appropriate literature. We adhered to the standards in the publications including data acquisition and imaging settings such as the microscope used, defocus range, spherical aberration, pixel spacing, acceleration voltage, electron dose and the correct usage of motion correction. Based on the microscope settings during the imaging process, we applied appropriate geometrical transformations. The defect files and the motion-correction files were flipped left-to-right or top-to-bottom and also rotated by specific degrees in clockwise/anti-clockwise direction as required. All these factors were thoroughly investigated and used during the data loading process in CryoSPARC.
Inspection of picked protein particles. The picked particles were inspected using a 2D colored histogram, as shown in Fig. 12. A particle of interest would have an intermediate local power score and a high template correlation (indicating its shape closely matches its template). Low local power scores indicate empty ice patches, even though it might resemble the template. Additionally, very high local power scores indicate carbon edges, aggregates, contaminants, and other objects with excessive densities that resemble particles.
As shown in Fig. 12 (B, bottom), we interactively specified the upper and lower thresholds for both the Power score and NCC score for each dataset improving the accuracy in the manual particle picking.
Cross-validation by two human experts. The results of the particles picked by the two Cryo-EM experts were compared to each other to make sure they are consistent. For example, two EMPIAR IDs: EMPIAR-10028 52 and EMPIAR-10081 53 with 300 micrographs (total 600 Cryo-EM micrographs) were used in cross-validation. www.nature.com/scientificdata www.nature.com/scientificdata/ The results of the 2D classes were compared based on total number of particles in each class, relative resolution of particles in the class, and distinct views of the structure of particles. Similar 2D classes, as shown in Fig. 13, achieved by two independent Cryo-EM specialists validate the accuracy of the manually labelled particles.
Comparison with existing state-of-the-art AI methods of particle picking. We conducted a comparison between the 3D density maps reconstructed from the particles picked by us and by two AI methods, namely Topaz and crYOLO. We utilized them to predict particles for two datasets, EMPIAR 10081 and EMPIAR 10345, each containing 300 micrographs. Subsequently, the ab-initio density map reconstruction and homo-refinement were performed using the generated star files containing the picked particles. To avoid any bias in the comparison, we repeated the ab-initio 3D reconstruction trials with three different random seeds for each method. The GSFSC resolution results for the three trails for each method are presented in Table 2. CryoPPP outperforms both Topaz and crYOLO.
For EMPIAR 10081, Topaz picked around 100,000 more particles than we. However, the best resolution of the density map reconstructed from CryoPPP picked particles in the three trials is 4.59 Å, substantially better than 6.75 Å of Topaz. This suggests that Topaz may have selected a substantial number of false positives (e.g., ice contaminations) or may have selected the same protein particles multiple times with slightly different center positions. CrYOLO, on the other hand, picked a significantly lower number of protein particles than us, which resulted in the worst resolution (9.33 Å) among the three. Similar results were obtained on EMPIAR 10345. The   Table 2. Comparison of CryoPPP, Topaz, and crYOLO on two EMPIAR DATASETS. Bold font denotes the best resolution of the density map reconstructed from picked particles in the three trials.
www.nature.com/scientificdata www.nature.com/scientificdata/ results confirm that the labeled particles in CryoPPP are of high quality and can be used to train/improve AI particle picking methods.
Cross validation with gold standard particles picked by the authors. Gold standard particles are those particles that were originally picked by the Cryo-EM experts who generated the cryo-EM data. There are only a few EMPIAIR IDs deposited in EMPIAR that have both the micrographs and the gold standard particles. To validate the accuracy of our picked particles, we compared our results with the already-existing gold standard particles that are publicly available through the EMPIAR website. We carried out 2D and 3D validation for EMPIAR-10345 54 and EMPIAR-10406 55 to validate our particle labelling process as follows.
2D particle class validation with gold standard. In order to get the gold standard 2D particles of the dataset, we downloaded the particle stack image files (.mrc) and.star file with the attributes of picked particles from EMPIAR. We used the particle stack and the star files to create the 2D classification results using CryoSPARC. Eventually, we compared our 2D class results with the gold standard. We performed the comparison based on the total number of classes, total number of picked particles, resolution, and visual orientation of the protein particle for each EMPIAR ID. Our results and the gold standard results exhibit strong correlations. It is worth noting that a high number of particles alone does not necessarily yield high resolution. Selecting a decent number of high-quality particles spanning a wide angular distribution is important for generating high 2D and 3D resolution.   www.nature.com/scientificdata www.nature.com/scientificdata/ Figure 14 shows the visual illustration 2D classification results for EMPIAR ID 10345 and EMPIAR ID 10406 published by the authors of the cryo-EM data and generated by us. They are consistent. Table 3 compares 2D classification results generated by original authors and by us. In both cases, (Fig. 14A,B) the same 300 micrographs were used for comparison. On EMPIAR ID 10345, CryoPPP's results have substantially better resolution than the authors' results for both N = 50 and N = 10 classes. On EMPIAR-10406, CryoPPP's results have better resolution for N = 50 particle classes and comparable resolution for N = 10 particle classes.
3D density map validation with gold standard. The ab initio reconstruction of the 3D density map for EMPIAR 10345 and EMPIAR 10406 was carried out using the gold standard particles picked by the original authors and the ones picked by us, respectively. The coordinates of the particle picked by the original authors were downloaded from the EMPIAR website. Three different random seeds were used in the ab initio reconstruction to www.nature.com/scientificdata www.nature.com/scientificdata/ avoid random bias. The results of the 3D density maps, resolution, and direction of distribution obtained from the two kinds of particles are compared in Figs. 15, 16. The detailed comparison results are reported in Table 4. The 'loose mask' curve in the Fourier Shell Correlation (FSC) plots uses an automatically produced mask with a 15 Å falloff. The 'tight mask' curve employs an auto-generated mask with a falloff of 6 Å for all FSC plots.
It is seen that CryoPPP outperforms the gold standard particles in terms of all the metrics on EMPIAR 10345 and achieves very similar results on EMPIAR 10406.
These rigorous validations with the gold standard picked particles and the comparison with the existing state-of-the-art AI methods clearly demonstrate the high quality of the data in CryoPPP, which enable it to serve as either the benchmark dataset for compare AI and classical methods of particle picking or the training and test dataset to develop new methods in the field. www.nature.com/scientificdata www.nature.com/scientificdata/

Code availability
The data analysis methods, software and associated parameters used in this study are described in the section of Methods. All the scripts associated with various steps of data curation are available at the GitHub repository: https://github.com/BioinfoMachineLearning/cryoppp, which includes the instructions about how to download the data.